Example: shipping books

When you buy a book from Amazon, you get a quote for how much it costs to ship. This is largely based on the weight of the book. If you didn't know the weight of a book, what other characteristics of it could you measure to help predict weight?

Example: shipping books

qplot(x = volume, y = weight, data = books)

Example: shipping books

qplot(x = volume, y = weight, data = books) + 
  geom_abline(intercept = m1$coef[1], slope = m1$coef[2], col = "orchid")

m1 <- lm(weight ~ volume, data = books)
summary(m1)
## 
## Call:
## lm(formula = weight ~ volume, data = books)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -190.0 -109.9   38.1  109.7  145.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.6793    88.3776    1.22     0.24    
## volume        0.7086     0.0975    7.27  6.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124 on 13 degrees of freedom
## Multiple R-squared:  0.803,  Adjusted R-squared:  0.787 
## F-statistic: 52.9 on 1 and 13 DF,  p-value: 6.26e-06

Q1: What is the equation for the line?

\[ \hat{y} = 107.7 + 0.708 x \] \[ \widehat{weight} = 107.7 + 0.708 volume \]

Q2: Is volume a significant predictor?

summary(m1)
## 
## Call:
## lm(formula = weight ~ volume, data = books)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -190.0 -109.9   38.1  109.7  145.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.6793    88.3776    1.22     0.24    
## volume        0.7086     0.0975    7.27  6.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124 on 13 degrees of freedom
## Multiple R-squared:  0.803,  Adjusted R-squared:  0.787 
## F-statistic: 52.9 on 1 and 13 DF,  p-value: 6.26e-06

Q3: How much of the variation in weight is explained by the model containing volume?

Q4: Does this appear to be a reasonable setting to apply linear regression?

We need to check:

  1. L inear trend
  2. I ndependent observations/errors
  3. N ormal residuals
  4. E qual variance

Residual Plot

qplot(x = .fitted, y = .stdresid, data = m1)

QQ plot

qplot(sample = .stdresid, data = m1) + geom_abline(col = "purple")

Multiple Regression

Multiple Regression

Allows us to create a model to explain one \(numerical\) variable, the response, as a linear function of many explanatory variables that can be both \(numerical\) and \(categorical\).

We posit the true model:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon; \quad \epsilon \sim N(0, \sigma^2) \]

We use the data to estimate our fitted model:

\[ \hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \ldots + b_p x_p \]

Estimating \(\beta_0, \beta_1\), etc.

In least-squares regression, we're still finding the estimates that minimize the sum of squared residuals.

\[ \sum_{i = 1}^n {e_i}^2 = \sum_{i = 1}^n \left(y_i - \hat{y}_i\right)^2\]

In R:

lm(y ~ x1 + x2 + ... + xp, data = mydata)

Example: shipping books

qplot(x = volume, y = weight, color = cover, data = books)

m2 <- lm(weight ~ volume + cover, data = books)
summary(m2)
## 
## Call:
## lm(formula = weight ~ volume + cover, data = books)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -110.1  -32.3  -16.1   28.9  210.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.9628    59.1927    3.34  0.00584 ** 
## volume         0.7180     0.0615   11.67  6.6e-08 ***
## coverpb     -184.0473    40.4942   -4.55  0.00067 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared:  0.927,  Adjusted R-squared:  0.915 
## F-statistic: 76.7 on 2 and 12 DF,  p-value: 1.45e-07

How do we interpret these estimates?

Example: shipping books

MLR slope interpretation

The slope corresponding to the dummy variable tell us:

  • How much vertical separation there is between our lines
  • How much weight is expected to increase if cover goes from 0 to 1 and volume is left unchanged.

Each \(b_i\) tells you how much you expect the \(y\) to change when you change the \(x_i\) by one unit, while holding all other variables constant.

summary(m2)
## 
## Call:
## lm(formula = weight ~ volume + cover, data = books)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -110.1  -32.3  -16.1   28.9  210.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.9628    59.1927    3.34  0.00584 ** 
## volume         0.7180     0.0615   11.67  6.6e-08 ***
## coverpb     -184.0473    40.4942   -4.55  0.00067 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared:  0.927,  Adjusted R-squared:  0.915 
## F-statistic: 76.7 on 2 and 12 DF,  p-value: 1.45e-07
  • Is the difference between cover types significant?
  • How much of the variation in weight is explained by a model containing both volume and cover?

summary(m2)$coef
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  197.963    59.1927    3.34 5.84e-03
## volume         0.718     0.0615   11.67 6.60e-08
## coverpb     -184.047    40.4942   -4.55 6.72e-04
qt(.025, df = nrow(books) - 3)
## [1] -2.18

Which of the following represents the appropriate 95% CI for coverpb?

  • A. \(197 \pm 1.96 \times 59.19\)
  • B. \(-184 \pm 2.18 \times 40.5\)
  • C. \(-184 \pm -4.55 \times 40.5\)

Extending the model

The two cover types have different intercepts. Do they share the same slope?

Extending the model

m3 <- lm(weight ~ volume + cover + volume:cover, data = books)
summary(m3)
## 
## Call:
## lm(formula = weight ~ volume + cover + volume:cover, data = books)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -89.7  -32.1  -21.8   17.9  215.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     161.5865    86.5192    1.87    0.089 .  
## volume            0.7616     0.0972    7.84  7.9e-06 ***
## coverpb        -120.2141   115.6590   -1.04    0.321    
## volume:coverpb   -0.0757     0.1280   -0.59    0.566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.4 on 11 degrees of freedom
## Multiple R-squared:  0.93,   Adjusted R-squared:  0.911 
## F-statistic: 48.5 on 3 and 11 DF,  p-value: 1.24e-06

Do we have evidence that two types of books have different relationships between volume and weight?

Take home messages

  • There is a statistically significant relationship between volume and weight.
  • There is a statistically significant difference in weight between paperback and hardcover books, when controlling for volume.
  • There is not strong evidence that the relationship between volume and weight differs between paperbacks and hardbacks.

This is inference, which required valid models. We'll check diagnostics next time.

qnorm(.025)
## [1] -1.96
qt(.025, df = 13)
## [1] -2.16
qt(.025, df = 14)
## [1] -2.14
qt(.05, df = 13)
## [1] -1.77
qt(.05, df = 14)
## [1] -1.76

Geometry of MLR

Ex: Restaurants in NYC

zagat

Ex: Restaurants in NYC

head(nyc, 3)
##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0
dim(nyc)
## [1] 168   7

What is the unit of observation?

A restaurant

What determines the price of a meal?

Let's look at the relationship between price, food rating, and decor rating.

You must enable Javascript to view this page properly.

With fitted plane

You must enable Javascript to view this page properly.

What determines the price of a meal?

\[ Price \sim Food + Decor \]

head(nyc, 3)
##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0
m1 <- lm(Price ~ Food + Decor, data = nyc)

Model 1: Food + Decor

summary(m1)
## 
## Call:
## lm(formula = Price ~ Food + Decor, data = nyc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.945  -3.766  -0.153   3.701  18.757 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -24.500      4.723   -5.19  6.2e-07 ***
## Food           1.646      0.262    6.29  2.7e-09 ***
## Decor          1.882      0.192    9.81  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.79 on 165 degrees of freedom
## Multiple R-squared:  0.617,  Adjusted R-squared:  0.612 
## F-statistic:  133 on 2 and 165 DF,  p-value: <2e-16

The geometry of regression models

The function for \(\hat{y}\) is . . .

  • A line when you have one continuous \(x\).
  • Parallel lines when you have one continuous \(x_1\) and one categorical \(x_2\).
  • Unrelated lines when you have one continuous \(x_1\), one categorical \(x_2\), and an interaction term \(x_1 : x_2\).

When you have two continuous predictors \(x_1\), \(x_2\), then your mean function is . . .

a plane

3D plot

You must enable Javascript to view this page properly.

Location, location, location

Does the price depend on where the restaurant is located in Manhattan?

\[ Price \sim Food + Decor + East \]

head(nyc, 3)
##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0

Model 2: Food + Decor + East

m2 <- lm(Price ~ Food + Decor + East, data = nyc)
summary(m2)
## 
## Call:
## lm(formula = Price ~ Food + Decor + East, data = nyc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.045  -3.881   0.039   3.392  17.756 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -24.027      4.673   -5.14  7.7e-07 ***
## Food           1.536      0.263    5.84  2.8e-08 ***
## Decor          1.909      0.190   10.05  < 2e-16 ***
## East           2.067      0.932    2.22    0.028 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared:  0.628,  Adjusted R-squared:  0.621 
## F-statistic: 92.2 on 3 and 164 DF,  p-value: <2e-16

The geometry of regression models

  • When you have two continuous predictors \(x_1\), \(x_2\), then your mean function is a plane.
  • When you have two continuous predictors \(x_1\), \(x_2\), and a categorical predictor \(x_3\), then your mean function represents parallel planes.

3D Plot

You must enable Javascript to view this page properly.

The geometry of regression models

  • When you have two continuous predictors \(x_1\), \(x_2\), then your mean function is a plane.
  • When you have two continuous predictors \(x_1\), \(x_2\), and a categorical predictor \(x_3\), then your mean function represents parallel planes.
  • When you add in interaction effects, the planes become tilted.

Model 3: Food + Decor + East + Decor:East

m3 <- lm(Price ~ Food + Decor + East + Decor:East, data = nyc)
summary(m3)
## 
## Call:
## lm(formula = Price ~ Food + Decor + East + Decor:East, data = nyc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.785  -3.665   0.378   3.729  17.636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -29.397      6.377   -4.61  8.1e-06 ***
## Food           1.663      0.282    5.90  2.1e-08 ***
## Decor          2.070      0.230    9.01  5.4e-16 ***
## East           9.662      6.218    1.55     0.12    
## Decor:East    -0.435      0.352   -1.24     0.22    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.71 on 163 degrees of freedom
## Multiple R-squared:  0.631,  Adjusted R-squared:  0.622 
## F-statistic: 69.8 on 4 and 163 DF,  p-value: <2e-16

3D plot

You must enable Javascript to view this page properly.

Comparing Models

  • The East term was significant in model 2, suggesting that there is a significant relationship between location and price.
  • That term became non-significant when we allowed the slope of Decor to vary with location, and that difference in slopes was also non-significant.
  • Notice that slope estimate for a given variable will almost always change depending on the other variables that are in the model.

The Problem of Model Selection

A given data set can conceivably have been generated from infinitely many models. Identifying the true model is like finding a piece of hay in a haystack. Said another way, the model space is massive and the criterion for what constitutes the "best" model is ill-defined.

The Problem of Model Selection

Best strategy: Use domain knowledge to constrain the model space and/or build models that help you answer specific scientific questions.

Another common strategy:

  1. Pick a criterion for "best".
  2. Decide how to explore the model space.
  3. Select "best" model in search area.

Tread Carefully!!! The second strategy can lead to myopic analysis, overconfidence, and wrong-headed conclusions.

What do we mean by "best"?

While we'd like to find the "true" model, in practice we just hope we're doing a good job at:

  1. Prediction
  2. Description

Synthetic example

How smooth should our model be?

Four candidates

m1 <- lm(y ~ x)
m2 <- lm(y ~ x + I(x^2))
m3 <- lm(y ~ x + I(x^2) + I(x^3))
m4 <- lm(y ~ x + I(x^2) + I(x^3) + I(x^4))

We can add polynomial terms to account for non-linear trends.

Four candidates

\(R^2\)

One way to quantify the explanatory power of a model.

\[ R^2 = \frac{SS_{reg}}{SS_{tot}} \]

This captures the proportion of variability in the \(y\) explained by our regression model.

Comparing \(R^2\)

c(summary(m1)$r.squared,
  summary(m2)$r.squared,
  summary(m3)$r.squared,
  summary(m4)$r.squared)
## [1] 0.439 0.919 0.992 0.994

The observed data is best explained by the quartic model. So that's the best model, right?

The BEST model!

The BEST model!

mBEST <- lm(y ~ poly(x, 20))
c(summary(m1)$r.squared,
  summary(m2)$r.squared,
  summary(m3)$r.squared,
  summary(m4)$r.squared,
  summary(mBEST)$r.squared)
## [1] 0.439 0.919 0.992 0.994 0.997

But surely that's not the best model…

Three Criteria

  1. \(R^2\)
  2. \(R^2_{adj}\)
  3. p-values
  • There are many others (\(AIC\), \(BIC\), \(AIC_C\), …).

\(R^2_{adj}\)

A measure of explanatory power of model:

\[ R^2 = \frac{SS_{reg}}{SS_{tot}} = 1 - \frac{SS_{res}}{SS_{tot}} \]

But like likelihood, it only goes up with added predictors, therefore we add a penalty.

\[ R^2_{adj} = 1 - \frac{SS_{res}/(n - (p + 1))}{SS_{tot}/(n - 1)} \]

\(R^2\) vs. \(R^2_{adj}\)

summary(mBEST)$r.squared
## [1] 0.997
summary(mBEST)$adj.r.squared
## [1] 0.994

poverty <- read.delim("poverty.txt", header = TRUE)
m1 <- lm(Poverty ~ Graduates, data = poverty)
summary(m1)
## 
## Call:
## lm(formula = Poverty ~ Graduates, data = poverty)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.162 -1.259 -0.218  0.961  5.444 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   64.781      6.803    9.52  9.9e-13 ***
## Graduates     -0.621      0.079   -7.86  3.1e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.08 on 49 degrees of freedom
## Multiple R-squared:  0.558,  Adjusted R-squared:  0.549 
## F-statistic: 61.8 on 1 and 49 DF,  p-value: 3.11e-10

poverty$Noise <- rnorm(nrow(poverty))
m2 <- lm(Poverty ~ Graduates + Noise, data = poverty)
summary(m2)
## 
## Call:
## lm(formula = Poverty ~ Graduates + Noise, data = poverty)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.287 -1.314 -0.121  0.971  5.406 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  65.2547     6.8260    9.56  1.1e-12 ***
## Graduates    -0.6259     0.0792   -7.90  3.1e-10 ***
## Noise         0.3165     0.3299    0.96     0.34    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.08 on 48 degrees of freedom
## Multiple R-squared:  0.566,  Adjusted R-squared:  0.548 
## F-statistic: 31.3 on 2 and 48 DF,  p-value: 1.98e-09

Logistic Regression

Building a spam filter

head(email)
##   spam to_multiple from cc sent_email                time image attach dollar winner inherit viagra
## 1    0           0    1  0          0 2011-12-31 22:16:41     0      0      0     no       0      0
## 2    0           0    1  0          0 2011-12-31 23:03:59     0      0      0     no       0      0
## 3    0           0    1  0          0 2012-01-01 08:00:32     0      0      4     no       1      0
## 4    0           0    1  0          0 2012-01-01 01:09:49     0      0      0     no       0      0
## 5    0           0    1  0          0 2012-01-01 02:00:01     0      0      0     no       0      0
## 6    0           0    1  0          0 2012-01-01 02:04:46     0      0      0     no       0      0
##   password num_char line_breaks format re_subj exclaim_subj urgent_subj exclaim_mess number
## 1        0    11.37         202      1       0            0           0            0    big
## 2        0    10.50         202      1       0            0           0            1  small
## 3        0     7.77         192      1       0            0           0            6  small
## 4        0    13.26         255      1       0            0           0           48  small
## 5        2     1.23          29      0       0            0           0            1   none
## 6        2     1.09          25      0       0            0           0            1   none
# where did this data come from / how was it collected?

How was the data collected?

  1. Choose a single email account
  2. Save each email that comes in during a given time frame
  3. Create dummy variables for each text component of interest
  4. Visually classify each as spam or not

Simple Filter A

Predicting spam or not using the presence of "winner"

## Warning: `position` is deprecated

If "winner" then "spam"?

Simple Filter B

Predicting spam or not using number of characters (in K)

Simple Filter B

Predicting spam or not using log number of characters (in K)

If log(num_char) < 1, then "spam"?

Challenges

Each simple filter can be thought of as a regression model.

Filter A

\(spam \sim winner; \quad X_1 \sim X_2\)

Filter B

\(spam \sim log(num\_char); \quad X_1 \sim W_1\)

Each one by itself has poor predictive power, so how can we combine them into a single stronger model?

Logistic Regression for B

\[spam \sim log(num\_char)\]

m1 <- glm(spam ~ log(num_char), data = email, family = "binomial")
summary(m1)
## 
## Call:
## glm(formula = spam ~ log(num_char), family = "binomial", data = email)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.815  -0.467  -0.335  -0.255   3.013  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.7244     0.0606   -28.4   <2e-16 ***
## log(num_char)  -0.5435     0.0365   -14.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2437.2  on 3920  degrees of freedom
## Residual deviance: 2190.3  on 3919  degrees of freedom
## AIC: 2194
## 
## Number of Fisher Scoring iterations: 5

Interpreting Logistic Regression

  1. Each row of the summary output is still a hypothesis test on that parameter being 0.
  2. A positive slope estimate indicates that there is a positive association.
  3. Each estimate is still conditional on the other variables held constant.

A more sophisticated model

m2 <- glm(spam ~ log(num_char) + to_multiple + attach + dollar + inherit + 
            viagra, data = email, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = spam ~ log(num_char) + to_multiple + attach + dollar + 
##     inherit + viagra, family = "binomial", data = email)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.931  -0.455  -0.328  -0.236   3.034  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.59642    0.06443  -24.78  < 2e-16 ***
## log(num_char) -0.54869    0.03831  -14.32  < 2e-16 ***
## to_multiple   -1.92889    0.30493   -6.33  2.5e-10 ***
## attach         0.19970    0.06552    3.05   0.0023 ** 
## dollar        -0.00456    0.01898   -0.24   0.8102    
## inherit        0.40003    0.15166    2.64   0.0083 ** 
## viagra         1.73607   40.59296    0.04   0.9659    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2437.2  on 3920  degrees of freedom
## Residual deviance: 2106.3  on 3914  degrees of freedom
## AIC: 2120
## 
## Number of Fisher Scoring iterations: 11

Extending the model

A GLM consists of three things:

  1. A linear predictor
  2. A distribution of the response
  3. A link function between the two
  • MLR : Normal distribution, identity link function

  • Logistic Regression : Binomial distribution, logit link function

  • Poisson Regression : Poisson distribution, logarithm link function